Problem Set 2

Author

Your Name Here

Aedes food variability1

Zeller and Koella studied the effects of low vs high food levels at different stages of development on various life history traits in mosquitoes. Read in this dataset (Aedes_food.txt) and examine its structure. Make the column names more informative by changing ef to early_food and lf to late_food. Also change the values in these columns to low (currently coded as L) and high (currently coded as H). Add a single column containing all four combinations of early and late food (e.g., high_high, high_low…).

#FIXME

DD <- read_delim("../Data/Aedes_food.txt",
                show_col_types = FALSE) |> 
  rename(early_food = ef,
         late_food = lf) |> 
  mutate(early_food = if_else(early_food == "L", "low", "high"),
         late_food = if_else(late_food == "L", "low", "high")) |>
  mutate(EL_food = paste0(early_food, "_", late_food))

Two traits the authors are interested in are survival time and fecundity. The columns beginning with eggs are the number of eggs laid by each individual at each age. The column adultsurvival is the number of days each individual lived following emergence. Make a plot of adultsurvival vs eggs4 with color indicating early food and shape indicating late food.

Adjust the size of the points, transparency, and axis labels as you’d like.

#FIXME

DD |> 
  ggplot(aes(eggs4, adultsurvival, color = early_food, shape = late_food)) +
  geom_point(alpha = 0.75, size = 2) +
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")
Warning: Removed 280 rows containing missing values (`geom_point()`).

Notice any warning messages. Compare your axis limits to the minimum and maximum values for your variables as well as the points plotted. Resolve any issues.

#FIXME

DD |>
  drop_na(any_of(c("eggs4", "adultsurvival"))) |>
  ggplot(aes(eggs4, adultsurvival, color = early_food, shape = late_food)) +
  geom_point(alpha = 0.75, size = 2) +
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")

Most egg number values are less than 50 and most survival is above 30. Change the x and y limits to limit the range to these values.

#FIXME

DD |> 
  drop_na(any_of(c("eggs4","adultsurvival"))) |>
  ggplot(aes(eggs4, adultsurvival, color = early_food, shape = late_food)) +
  geom_point(alpha = 0.75, size = 2) +
  ylim(30, 60) +
  xlim(0, 50) +
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")
Warning: Removed 2 rows containing missing values (`geom_point()`).

Observe any warnings. Note that if you decided to restrict your axis limits to exclude points, you would always need to justify and explain that decision in the figure legend and/or paper. Paying attention to and finding the reason for these warnings from ggplot should never be ignored (even if there are NA values). You should always know why it is warning you about the exact number of points.

Try using the single column indicating the four food combinations as a color aesthetic instead of both color and shape.

#FIXME

DD |> 
  drop_na(any_of(c("eggs4","adultsurvival"))) |>
  ggplot(aes(eggs4, adultsurvival, color = EL_food)) +
  geom_point(alpha = 0.75, size = 2) +
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")

Which plot - that with both shape and color aesthetics or just the color aesthetic do you think communicates the pattern more effectively?

Let’s use the single column with four levels plot to add regression lines for each food treatment using geom_smooth with method = "lm".

#FIXME

DD |>
  drop_na(any_of(c("eggs4", "adultsurvival"))) |>
  ggplot(aes(eggs4, adultsurvival, color = EL_food)) +
  geom_point(alpha = 0.75, size = 2) +
  geom_smooth(formula = 'y ~ x', method = 'lm', se = FALSE)+
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")

There are different point types in R you might want to use in some applications. For example, if you add pch = 21 you can add an outline to points to make them more defined. Try this with the plot you just made and adjust aesthetics as needed to make colored points with a black outline. Make sure your regression lines maintain their different colors (hint: you can use aes() within different geoms too).

#FIXME

DD |>
  drop_na(any_of(c("eggs4","adultsurvival"))) |>
  ggplot(aes(eggs4, adultsurvival, fill = EL_food)) +
  geom_point(alpha = 0.75, size = 2, pch = 21) +
  geom_smooth(formula = 'y ~ x',
              method = 'lm', se = FALSE, aes(color = EL_food))+
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")

In many cases, using color aesthetics is the best approach with customization of which colors to use with packages and or functions like scale_color_manual() and this often helps avoid bugs. However, in some cases with complex multi-element plots, you might need to take full control of the color of your points or some other element. Let’s make this same plot using this method.

  1. Choose four colors from the R color chart or use a brewer to generate four colors.
  2. Use case_when() to make a column in your tibble that specifies one of these colors for each of the four combinations.
  3. Now use this new column as your fill aesthetic and observe what happens
#FIXME

DD |>
  drop_na(any_of(c("eggs4","adultsurvival"))) |>
  mutate("col_plot" = case_when(EL_food == "low_low" ~ "orange",
                                EL_food == "low_high" ~ "darkred",
                                EL_food == "high_low" ~ "lightblue",
                                EL_food == "high_high" ~ "steelblue")) |>
  ggplot(aes(eggs4, adultsurvival, fill = col_plot)) +
  geom_point(alpha = 0.75, size = 2, pch = 21) +
  geom_smooth(formula = 'y ~ x',
              method = 'lm',se = FALSE, aes(color = col_plot))+
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number")

You should have seen that your colors are just treated as categories and given an automatic color. Now tell R you want to use those exact colors by using scale_fill_identity() and/or scale_color_identity(). Note that you will not have a legend unless you specify breaks and labels to this function. You’ll see more of this when we look at customizing colors in plots more later.

#FIXME

p1 <- DD |>
  drop_na(any_of(c("eggs4","adultsurvival"))) |>
  mutate("col_plot" = case_when(EL_food == "low_low" ~ "orange",
                                EL_food == "low_high" ~ "darkred",
                                EL_food == "high_low" ~ "lightblue",
                                EL_food == "high_high" ~ "steelblue")) |>
  ggplot(aes(eggs4, adultsurvival, fill = col_plot)) +
  geom_point(size = 2, pch = 21) +
  geom_smooth(formula = 'y ~ x',
              method = 'lm', se = FALSE, aes(color = col_plot))+
  ylab("Adult Survival") +
  xlab("Age 4 Egg Number") +
  scale_fill_identity() +
  scale_color_identity()

p1

Check your color choices using one of the packages we showed you in lecture to test the accessibility of your plot for color blindness. Change your colors if needed.

#FIXME

cvd_grid(p1)

Let’s look at how the food treatments affected the size of the mosquitoes at age 4 (size.day4). Make a plot of age 4 size vs. early food and color the points by late food. Try out the following jitter values (0.1, 0.2, 0.3, 0.4, 0.5), and pick one that works well.

#FIXME

DD |>
  drop_na(any_of("size.day4")) |>
  ggplot(aes(early_food, size.day4, color = late_food)) +
  geom_point(alpha = 0.75, size = 2,
             position = position_jitter(width = 0.1,
                                        seed = 478221)) +
  ylab("Age 4 Size") +
  xlab("Early Food")

We can improve our plot by ordering the factors so low comes before high and dodging the points by late food as well. Again, play around with both the jitter and dodge widths (and alpha) to find values that work well for this plot.

#FIXME

DD |>
  mutate(early_food = factor(early_food, levels= c("low", "high")),
         late_food = factor(late_food, levels= c("low", "high"))) |>
  drop_na(any_of("size.day4")) |>
  ggplot(aes(early_food, size.day4, color = late_food)) +
  geom_point(alpha = 0.75, size = 2,
             position = position_jitterdodge(jitter.width = 0.2,
                                             dodge.width = 0.3,
                                             seed = 89879273)) +
  ylab("Age 4 Size") +
  xlab("Early Food")

Let’s add means to our plot. Make a tibble using group_by and summarize with the columns early_food, late_food, and size.day4 that holds the treatments and mean values, making sure the column names match what you’ve been plotting so far. Then add these means to your plot with another geom_point and make the points pretty big with size = 5.

#FIXME

s_means <- DD |>
  mutate(early_food = factor(early_food, levels= c("low", "high")),
         late_food = factor(late_food, levels= c("low", "high"))) |>
  group_by(early_food, late_food) |>
  summarise(size.day4 = mean(size.day4, na.rm = TRUE),
            .groups = "drop")

DD |>
  mutate(early_food = factor(early_food, levels= c("low", "high")),
         late_food = factor(late_food, levels= c("low", "high"))) |>
  drop_na(any_of("size.day4")) |>
  ggplot(aes(early_food, size.day4, color = late_food)) +
  geom_point(alpha = 0.75, size = 2,
             position = position_jitterdodge(jitter.width = 0.2,
                                             dodge.width = 0.3,
                                             seed = 89879273)) +
  geom_point(data = s_means, size = 5) +
  ylab("Age 4 Size") +
  xlab("Early Food")

Which aspects of your plot were automatically applied to your mean data and which were not? How does ggplot determine this?

It is difficult to see the points when they are the same color so pick a distinguishing color and set that for the means. Also make sure the points are aligned with the correct data (hint: if you use color as your grouping variable, you might need to also use group_by() so it is maintained when you change your colors for the mean points).

#FIXME

s_means <- DD |>
  mutate(early_food = factor(early_food, levels= c("low", "high")),
         late_food = factor(late_food, levels= c("low", "high"))) |>
  group_by(early_food, late_food) |>
  summarise(size.day4 = mean(size.day4, na.rm = TRUE),
            .groups = "drop")

DD |>
  mutate(early_food = factor(early_food, levels= c("low", "high")),
         late_food = factor(late_food, levels= c("low", "high"))) |>
  drop_na(any_of("size.day4")) |>
  ggplot(aes(early_food, size.day4, color = late_food)) +
  geom_point(alpha =0.75, size = 2,
             position = position_jitterdodge(jitter.width = 0.2,
                                             dodge.width = 0.3,
                                             seed = 89879273)) +
  geom_point(data = s_means, aes(group = late_food), size = 5,
             position = position_dodge(width = 0.3), color = "black") +
  ylab("Age 4 Size") +
  xlab("Early Food")

Bat calcars2

Stanchak et al. measured several skeletal aspects of bat limbs for over 200 species of bats. The bat calcar is a small cartilaginous projection from the ankle that helps to support the wing membrane. They hypothesized that the size of the calcar might be related to feeding ecology.

Read in the Bat_calcar.csv data. Make a new tibble that only includes bat families with at least 4 observations.

#FIXME

BB <- read_csv("../Data/Bat_calcar.csv",
              show_col_types = FALSE)

Family_counts <- BB |> 
  group_by(Family) |> 
  count() |> 
  filter(n > 3)

BB_filt <- BB |> filter(Family %in% Family_counts$Family)

To explore the relationship between calcar length and leg length, plot , Tib_mean (average tibia length) versus Cal_mean (average calcar length). Facet by bat family.

#FIXME

BB_filt |> 
  ggplot(aes(Tib_mean, Cal_mean)) +
  geom_point() +
  facet_wrap("Family")

The different bat families have fairly different ranges. Make this plot again but allow the axes to vary in each.

#FIXME

BB_filt |> 
  ggplot(aes(Tib_mean, Cal_mean)) +
  geom_point() +
  facet_wrap("Family", scales = "free")

Which of these sets of plots do you think captures the pattern more accurately?

Stickleback life history3

Ghani et al. performed a study looking at genetic differentiation in the timing of maturity in two stickleback populations using a laboratory breeding experiment where they raised different families of sticklebacks measuring the timing of their maturation. Read in the Stickleback_maturation.xlsx data and examine its structure.

#FIXME

SS <- read_excel("../Data/Stickleback_maturation.xlsx")

Make a plot of Maturation day versus Sex and color the points by family.

#FIXME

SS |>
  ggplot(aes(Sex, `Maturation day`, color = `Family across`)) +
  geom_point()

You probably noticed some issues with how the columns are coded (e.g., numeric vs factor). Alter these or create new columns to create the plot you want. Note that in this dataset, males are coded as 1 and females as 0.

#FIXME
SS |>
  mutate("Maturation_day" = as.numeric(`Maturation day`),
         "Family_fac" = as.factor(`Family across`),
         "Sex" = case_when(Sex == 0 ~ "Female",
                           Sex == 1 ~ "Male")) |>
  ggplot(aes(Sex, Maturation_day, color = Family_fac)) +
  geom_point()
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Maturation_day = as.numeric(`Maturation day`)`.
Caused by warning:
! NAs introduced by coercion
Warning: Removed 31 rows containing missing values (`geom_point()`).

Notice any warnings and investigate their source and resolve any issues so you don’t get a warning when you plot.

#FIXME
unique(SS$`Maturation day`)
 [1] "346" "348" "365" "362" "364" "345" "347" "349" "352" "375" "376" "363"
[13] "367" "361" "366" "360" "350" "351" "353" "372" "356" "357" "355" "368"
[25] "379" "370" "378" "373" "369" "NA"  "377" "381" "371" "374" "359" "354"
[37] "358" "385" "380" "384" "383" "386" "382"
SS_clean <- SS |>
  mutate("Maturation_day" = as.numeric(`Maturation day`),
         "Family_fac" = as.factor(`Family across`),
         "Sex" = case_when(Sex == 0 ~ "Female",
                           Sex == 1 ~ "Male"))
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `Maturation_day = as.numeric(`Maturation day`)`.
Caused by warning:
! NAs introduced by coercion
SS_clean$`Maturation day`[which(is.na(SS_clean$Maturation_day))]
 [1] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
[16] "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA" "NA"
[31] "NA"
SS_clean |>
  drop_na(any_of("Maturation_day")) |>
  ggplot(aes(Sex, Maturation_day, color = Family_fac)) +
  geom_point()

Because we just want to see different families but the numbers are not meaningful, let’s drop the legend using theme. And with males and females labeled, we probably don’t need the x axis label either so drop that too.

#FIXME
SS_clean |>
  drop_na(any_of("Maturation_day")) |>
  ggplot(aes(Sex, Maturation_day, color = Family_fac)) +
  geom_point() +
  theme(legend.position = "none",
        axis.title.x = element_blank())

Footnotes

  1. Zeller, M., and J. C. Koella. 2016. Effects of food variability on growth and reproduction of Aedes aegypti. Ecol. Evol. 6:552–559.↩︎

  2. Stanchak, K. E., J. H. Arbour, and S. E. Santana. 2019. Anatomical diversification of a skeletal novelty in bat feet. Evolution 73:1591–1603.↩︎

  3. Ghani, N. I. A., G. Herczeg, T. Leinonen, and J. Merilä. 2013. Evidence for genetic differentiation in timing of maturation among nine-spined stickleback populations. J. Evol. Biol. 26:775–782.↩︎